Project 1

DOMAIN: Healthcare

CONTEXT: Medical research university X is undergoing a deep research on patients with certain conditions. University has an internal AI team. Due to confidentiality the patient’s details and the conditions are masked by the client by providing different datasets to the AI team for developing a AIML model which can predict the condition of the patient depending on the received test results.

DATA DESCRIPTION: The data consists of biomechanics features of the patients according to their current conditions. Each patient is represented in the data set by six biomechanics attributes derived from the shape and orientation of the condition to their body part.

1. P_incidence
2. P_tilt
3. L_angle 
4. S_slope
5. P_radius 
6. S_degree
7. Class


PROJECT OBJECTIVE: Demonstrate the ability to fetch, process and leverage data to generate useful predictions by training Supervised Learning algorithms.

Import and warehouse data:

Import all the given datasets and explore shape and size of each.
Merge all datasets onto one and explore final shape and size.

Observation:

There are 3 dataframes created from each individual files.
There are 310 records in the final dataframe which has data for Normal, Type_H and Type_S class

Data cleansing:

Explore and if required correct the datatypes of each attribute
Explore for null values in the attributes and if required drop or impute values.
Class is object we need to change the datatype of this column
There are no null values.
Running above cell multiple times will help in view different sets of data
it looks like we need to have closer look at "Class" column

Observation:

The Class column needs transforamtion, we need to correct the values.
Here TypeH and type_h, tp_s and Type_S, Normal and Nrmal represents same class.

Data analysis & visualisation:

Perform detailed statistical analysis on the data.
Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis
Above table provides some basic statistical details like percentile, mean, std etc. of a data.
All Feature variables are with numerical data.

Observation on each column are mentioned below

P_incidence - 75% of values are less than 72 and max value observed is 129.83, we need to plot boxplot to check for any outliers. Mean and Median values are not having much difference and are nearly equal. Distribution might be normal.

P_tilt - Mean and Median values are not having much difference and are nearly equal. Distribution might be normal. This feature(column) contains negative values.

L_angle - Mean and Median values are not having much difference and are nearly equal. Distribution might be normal. Since max and 75% has difference - there might be some outliers

S_slope - Mean and Median values are not having much difference and are nearly equal. Distribution might be normal. 75% of values are lesser than 52 but maximum value is 121.4 .

P_radius - Mean and Median values are not having much difference and are nearly equal. Distribution might be normal.

S_Degree - Difference between 75% and max is large, so we can view the data once to check if there is any outlier. The median and mean are not close and mean is greater than median and hence it is a positively skewed distribution. In general, mean and min differ - so we need to check for distribution of data (skweness).

Lets check the distribution of data for all feature variables, we can use distplot, boxplot/violinplot, swarm plot for univariate and bivariate analysis.
S_Degree data seems to have outlier, and data is positively skewed distribution
P_incidence, P_tilt, L_angle, S_slope, P_radius - data distribution is normal. There are not outlier which are visible.
To check on outlier explicitly, let's use box plot.
As seen earlier, apart from S_Degree there are no other column which are having outliers.
Lets add good complement to a box plot by view swarm plot for the data. Since we have less number of rows the swarm plot would help to visulaize the data.
Observation

P_incidence ::

There are 3 outliers seen in data.
Data is normally distributed, with less outlier values.

P_tilt ::

There are few outliers on both negative and positive end.
It is has little skewness towards right side

L_angle ::

Data is normally distributed, there is some skewness due to one outlier.

S_slope ::

Data is normally distributed, there is some skewness due to one outlier.    

P_radius ::

Here we see more outliers, there are about 11 outliers observed in data.

S_Degree ::

Right Skewed data is due to more outliers observed in data.
There are about 10 outliers observed in data.

Summary : S_Degree data seems to have outlier, and data is positively skewed distribution. P_incidence, P_tilt, L_angle, S_slope, P_radius - data distribution is normal. 
Lets add complement to a box plot by view swarm plot for the data, with comparison on Class variable. Since we have less number of rows the swarm plot would help to visulaize the data. 
Class "Type_S" has more range of spread of data as observed. Number of data points are more for Type_S.
Class "Type_H" and "Normal" are having similar spread.
Most of Outlier in dataset is from "Type_S" Class, to check on this further let's plot boxplot as well

P_Incidence :

Type_S Class is having larger distribution of data and there are outlier observed in this class.
Normal class Value is slightly higher than Type_H class


P_tilt :

There are no outliers observed in Type_S class. There are outliers observed in Other two classes.

L_angle

We can see Normal class has higher values compared to type_H class
Each class has outlier

S_slope

Normal and Type S class has few outlier


P_radius

Each class has few outlier
Extreme values for Type_s class

S_Degree

S_Degree has extreme values for type_S Class
Distribution for Normal and type_H almost looks same. Here for these 2 classes (only) we cannot really see by initial analysis whether the S_degree plays significant role or not. 


Type_S has wide range of data distribution in general.
To get better view of data, lets draw heat map
The heat map clearly shows correlation between variables. As observed, this is having mix correlation of data.
P_incidence is having good correlation to other variables - 'P_tilt',  'L_angle',  'S_slope' and 'S_Degree'.
S_Degree and P_incidence is having high correlation.
P_radius is having less correlation to other variables and negative correlation.
Class "Type_S" is generally having more spread of data when compared one variable with another.

Clusters can be observed in the data set.
    Class "Type_S"  - Cluters can be observed for feature S_Degree v/s other feature in the dataset
                    - Cluters can be observed for feature P_incidence v/s other features
                    - Cluters can be observed for feature S_slope v/s other features
    Class "Type_H"  - Cluters can be observed for feature S_slope v/s other features
                    - Cluters can be observed for feature L_angle v/s other features
    Class "Normal"  - is having overall spread

S_Degree may have more influence on the classification, while P_tilt seem to have lesser influence. 



relationship between variables :: (this is well observed in heat map)

P_incidence v/s P_tilt, L_angle, S_slope, S_Degree' are positively correlated.

S_Degree is having  positively correlation (almost) to other variables.

L_angle' and S_slope are having positively correlation.
Checking class variable - "Normal"
We may need outlier treatment, so lets Check mean values of each class for all independent variables

Data pre-processing:

Segregate predictors vs target attributes
Perform normalisation or scaling if required.
Check for target balancing. Add your comments.
Perform train-test split.
Outlier treatment is done in below cell
As seen in above table, there are no outliers observed. Lets check it using numbers
Generally any distance based algorithm, it is recommended to scale the data.
classes in Y (y_data) variable are not balanced, 
Here observations belonging to one class is lower than those belonging to the other classes.
Some times algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets.
Since data is not significantly imbalanced, we can use the dataset as is.
    lets check the shape of the train and test dataset.

Model training, testing and tuning:

 Design and train a KNN classifier.
 Display the classification accuracies for train and test data.
 Display and explain the classification report in detail.
 Automate the task of finding best values of K for KNN.

 Apply all the possible tuning techniques to train the best model for the given data. Select the final best trained 
 model with your comments for selecting this model. 
This model is having good accuracy of 0.88 for train data
precision and recall high score for all classes which signifies that miscalssification in terms of telling patient who is having decease as not having is less and also chances of telling patient that he is not having a decease as having is less.
This model is having good accuracy of 0.80 for test data
precision and recall high score for all classes which signifies that miscalssification in terms of telling patient who is having decease as not having is less and also chances of telling patient that he is not having a decease as having is less.
The number 3, 4, 1 which are in lower part of diagnol showing less number which signifies that misclassification in terms of saying a person having decease as not having is less.
Best model can be got by using above mentioned parameter.
MSE is lowest for k = 5, which also signifies the value of k to be used in the model.
As seen in above graph, the test data accuracy is never below .70 for k upto 50
With Cross validation we are getting best K value as 11

Conclusion and improvisation:

• Write your conclusion on the results.
• Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points 
collected by the research team to perform a better data analysis in future.

Conclusion:

For the classification/prediction of patient, this model is suitable when used with k value as 5.
This model is having good accuracy of 0.80 for test data
precision and recall high score for all classes which signifies that miscalssification in terms of telling patient who is having decease as not having is less and also chances of telling patient that he is not having a decease as having is less.

Improvement: Patient history can also be collected for improvising model. Other common diseases like diabetes, blood pressure can also be obtained to have better prediction in the model.

Project 2

DOMAIN: Banking and finance

CONTEXT: A bank X is on a massive digital transformation for all its departments. Bank has a growing customer base whee majority of them are liability customers (depositors) vs borrowers (asset customers). The bank is interested in expanding the borrowers base rapidly to bring in more business via loan interests. A campaign that the bank ran in last quarter showed an average single digit conversion rate. Digital transformation being the core strength of the business strategy, marketing department wants to devise effective campaigns with better target marketing to increase the conversion ratio to double digit with same budget as per last campaign.

DATA DESCRIPTION:

The data consists of the following attributes:

  1. ID: Customer ID
  2. Age: Customer’s approximate age.
  3. CustomerSince: Customer of the bank since. [unit is masked]
  4. HighestSpend: Customer’s highest spend so far in one transaction. [unit is masked]
  5. ZipCode: Customer’s zip code.
  6. HiddenScore: A score associated to the customer which is masked by the bank as an IP.
  7. MonthlyAverageSpend: Customer’s monthly average spend so far. [unit is masked]
  8. Level: A level associated to the customer which is masked by the bank as an IP.
  9. Mortgage: Customer’s mortgage. [unit is masked]
  10. Security: Customer’s security asset with the bank. [unit is masked]
  11. FixedDepositAccount: Customer’s fixed deposit account with the bank. [unit is masked]
  12. InternetBanking: if the customer uses internet banking.
  13. CreditCard: if the customer uses bank’s credit card.
  14. LoanOnCard: if the customer has a loan on credit card.

Import and warehouse data:

Import all the given datasets and explore shape and size of each.
Merge all datasets onto one and explore final shape and size
After merging two dataset - we get 14 column dataset and dataset is having 5000 rows

Data cleansing:

Explore and if required correct the datatypes of each attribute.    
Explore for null values in the attributes and if required drop or impute values.
LoanOnCard is of float data type. This we can change it to categorical data type.
By looking at sample data and it's description, there are few categorical variable which are having int or float datatype.
Lets take closer look at Level, Security, FixedDepositAccount, InternetBanking, CreditCard variables
The above variables looks like category variables. So let's convert it.
Check for any null values in the dataset, remove if available
Class variable analysis is done below
There is imbalance in the class variable, lets check the same by ploting and adding hue to it on credit card.
This is specifically being done to check affect of CreditCard variable on LoanOnCard
People having credit card and not having creditcard have got loans. so we need to check correlation between variables to decide whether to retain this variable for model building.

Data analysis & visualisation:

Perform detailed statistical analysis on the data.
Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis
Above table provides some basic statistical details like percentile, mean, std etc. of a data.
All Feature variables are with numerical data.

Observation on each column are mentioned below

ID:

column is of no signiifcane and hence can be dropped while building model

Age:

This column depicts age of customer. we need to check correlation between this variable and class variable.
max age which is observed is 67
average age of in dataset is 45

CustomerSince:

75% of data are below 30, which signifies 75% of customers in bank are having less than 30 years of relationship

HighestSpend:

75% of values are below 98, difference between max and 75% is high, there could be possible outliers in this variable, we need to check the column using boxplot.

ZipCode:

This column signifies area code of customer, we need to check correlation between this variable and class variable.

HiddenScore:

The values in this variable seems to be normally disteributed, mean is higher than median value which signifies skewness in data. If the mean is greater than the mode, the distribution is positively skewed. 

MonthlyAverageSpend:

There is difference between max and 75% which signifies it may have outliers.

Mortgage:

There is difference between max and 75% which signifies it may have outliers.
Boxplot and distplot helps us in visualizing data distribution and outlier detection.
For these plots we only consider continous variable present in this dataset.

Age ::

The data seems to be normally distribution with no outliers. There are fews modes of data observed in the graph

CustomerSince ::

This variable data behaviour is similar to that of age variable, it is better to see the correlation between these 2 variables.

HighestSpend ::

There is right skew in data, with few outliers.

MonthlyAverageSpend ::

There is right skew in data, with more outliers. HighestSpend and MonthlyAverageSpend tends to follow same kind of data distribution. These variable may have more correaltion. 

Mortgage ::

There are more people with no mortage, and for other who have mortage, the data seems to be right skewed.
We can perform bivariate analysis on variables and check the influence on the Class variable
Age and CustomerSince variable are not having any outlier for either class variable.
HighestSpend has outlier for "0" class and none for other class
Mortgage - few customer may have mortgage, where many may not have it at all. Even though outliers are observed, we can take a deep look into it to decide on outliers in this column.
MonthlyAverageSpend - has outlier for "0" class and only one for other class.

For all of these variable, it is better to have deep look into the data and take value counts of same to decide on outlier.
It is better to have pairplot on non categorical data which would help us in better miltivariate analysis
Here we can ignore zipcode variable as it is not showing any significant relationship with other variables.
As we compare he Mortage and HighestSpend variable, people with high mortage tends to have Highest spend
As we compare he MonthlyAverageSpend and HighestSpend variable, people with high  MonthlyAverageSpend tends to have Highest spend.
Age and CustomerSince having linear colinearity which is bit obvious and depicted in data as well.
Lets check the correlation between variables which would help us in determining the columns required for the model building.
The data needs more analysis and it is easier to visualize if it is available in heat map.
Lets also check the data distribution on all column from the Raw data set
From above set of graph we can observe that 
1) There is no correlation betwee CreditCard and LoanOnCard. We may need to exlcude these variables.
2) ID column can be ignored as this variable is of no significance.
3) Zipcode is not having any correlation with the LoanOnCard variable.
4) Below mentioned variables are having low correlation with LoanOnCard variable
    CustomerSince
    HiddenScore
    Security
    InternetBanking

    We need to check for accuracy of model in general when using above mentioned variable.

 5) There is good correlation betwen MonthlyAverageSpend and HighestSpend.
 6) Age and customersince is having high correlation - hence we can consider only one variable when building model
Response variable distribution check
    There is imbalance in the dataset, which needs to be corrected before building the model for better results.
    Lets check the class summary.

Data pre-processing:

• Segregate predictors vs target attributes
• Check for target balancing and fix it if found imbalanced.
• Perform train-test split.
Based on the correlation, we consider below mentioned variable for model building.
Age, HighestSpend, HiddenScore, MonthlyAverageSpend, Level, Mortgage, Security, FixedDepositAccount, InternetBanking
Outlier data, as observed in EDA section - there are outliers on the variable which are considered for model building.
lets check data using value counts to decide on treating it.
After checking all 3 significant columns in the data set and getting their value counts, we can keep the data as is. The value counts of outliers are repeating and the values which are having less value counts are with in range (i.e. less then max value).
Perform train-test split with 30% test size.
we apply stratify to get even distribution of data in terms of Class
Lets check whether the target variable train, test are well preserved.
Balancing dataset by oversampling minority stratergy

Model training, testing and tuning:

• Design and train a Logistic regression and Naive Bayes classifiers.
• Display the classification accuracies for train and test data.
• Display and explain the classification report in detail.
• Apply all the possible tuning techniques to train the best model for the given data. Select the final best trained model with your comments for selecting this model. 
Navie Bayes will be done at end of this section.
For train data : train Accuracy is 90% which is good score for the model

As seen in the classification report, it is having high recall score which is required for this model since we don't want to loose a customer who is having potential to replay the loan

The model is having good precision which signifies which customer to be avoided.

we need to check the confusion matrix for more details on the number for truth v/s prediction.
For test data : test Accuracy is 87% which is good score for the model

As seen in the classification report, it is having high recall score which is required for this model since we don't want to loose a customer who is having potential to replay the loan
The recall is almost similar to that of the train model

The model is having good precision which signifies which customer to be avoided.

we need to check the confusion matrix for more details on the number for truth v/s prediction.
The above information signifies that all variable which are considered are important for model building.
The confusion matrix

True Positives (TP): predicted true conversion rate 134

True Negatives (TN): predicted false conversion rate 1175

False Positives (FP): 13 Falsely predict positive Type I error

False Negatives (FN): 172 Falsely predict negative Type II error
Lets check the best parameter for the model using gridsearch from sklearn
Solver : Algorithm to use in the optimization problem.
        Here we choose liblinear since for small datasets, ‘liblinear’ is a good choice. And observation also say the same.
Penalty :  Regularization works by biasing data towards particular values.
        L2 regularization adds an L2 penalty equal to the square of the magnitude of coefficients.
C : Describes the inverse of regularization strength. It is a contorl variable.
        Here value 10 is choosen
The same parameters has been used in the earlier designed logistic regression classifier.
Lets check model performance with cross validation
The mean value of accuracy is 90 which is what we achieved in earlier model.
NB classifier
For test data : test Accuracy is 85%

As seen in the classification report, it is having high recall score (of aroung 78%) which is required for this model since we don't want to loose a customer who is having potential to replay the loan
The recall is almost similar to that of the train model

we need to check the confusion matrix for more details on the number for truth v/s prediction.
The confusion matrix

True Positives (TP): predicted true conversion rate 115

True Negatives (TN): predicted false conversion rate 1165

False Positives (FP): 32 Falsely predict positive Type I error

False Negatives (FN): 182 Falsely predict negative Type II error
Lets check the AUC ROC curve to check the model performance
Generally a thumb rule followed while measuring model - with value more than 0.75 is good.
Here we have good AUC value which signifies that the model can be used.
Since the Threshold value of 0.53 has been observed, it mean while classification, we should use 0.53 as threshold. If the probability of particular customer is above 0.53 then only we can say that he belongs to class 1 or he will avail for loan else assign him the class 0 mean he won't avail for loan.

Conclusion and improvisation:

• Write your conclusion on the results.
• Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the Bank to perform a better data analysis in future.

The models which we have built are logistic regression and naive bayes, here logistic regression is best suited for the classification since it is having better accuracy and recall score.

As it is observed in confusion matrix, Type 1 and Type 2 errors are less in logistic regression and hence this model may provide better result which used in production.

The model has correctly predicted True positive and true negative and comparison of number proves that model can be used.

If the probability of particular customer is above 0.53 then only we can say that he belongs to class 1 or he will avail for loan else assign him the class 0 mean he won't avail for loan.

Improvements
To make  insight-driven marketing campaigns, banks must think about customised recommendation for  individual needs.

This need collection of more data about customer. Customer purchase history would help us here in getting more insight.

If possible customer behavioural and psychographic  data can be collected for better customer segmentation.

If allowed, customer social media data can be analysed which gives insight into customer desires.

If possible, we can getting customer salary and type of work (salaried employee, self employed) would also help.